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Abstract 

We study the statistical properties of the SIR epidemics in heterogeneous networks, 
when an epidemic is defined as only those SIR propagations that reach or exceed a 
minimum size s c . Using percolation theory to calculate the average fractional size 
(Msir) of an epidemic, we find that the strength of the spanning link percolation 
cluster Pqo is an upper bound to (Msir). For small values of s c , P^ is no longer a 
good approximation, and the average fractional size has to be computed directly. 
The value of s c for which P^ is a good approximation is found to depend on 
the transmissibility T of the SIR. We also study Q, the probability that an SIR 
propagation reaches the epidemic mass s c , and find that it is well characterized by 
percolation theory. We apply our results to real networks (DIMES and Tracerouter) 
to measure the consequences of the choice s c on predictions of average outcome sizes 
of computer failure epidemics. 
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The study of disease spread has seen renewed interest recently [Tj2f3] due the 
emergence of new infectious lethal diseases such as AIDS and S ARS [lf5] . New 
tools, ranging from powerful computer models [6] to new conceptual develop- 
ments [1,7,8.9.10,11], have emerged in hopes of understanding and addressing 
the problem effectively. 

Among the new tools that have become available to tackle infectious dis- 
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ease propagation, complex network theory [T2"fl 3] has seen considerable inter- 

est [5f2] . as a way to address the shortcomings of more classic approaches [I] 
where all individuals in the population of interest are assumed to have an 
equal probability to infect all other individuals (random-mixing). In contrast 
to the random-mixing approach, complex networks (heterogenous mixing) as- 
sume that each individual (represented by a node) has a defined set of contacts 
(represented by links) to other specific individuals (called neighbors), and in- 
fections can be propagated only through these contacts. This new technical 
framework has produced novel insights that are expected to help considerably 
in the fight against infectious diseases PI5] . 

The use of complex network theory requires a few pieces of information in 
order to be correctly applied. First, it is important to understand the kind 
of disease being considered, as this will dictate the specifics of the network 
model that needs to be used. For example, the flu virus usually spreads among 
people that come in contact even briefly, leading to networks with fat-tailed 
distributions of connections with large average degree [5]. On the other hand, 
sexually transmitted diseases are better described by more sparse, and fairly 
heterogeneous contact networks [I] . Thus, these two examples easily illustrate 
one of the complications of the problem: the structure of the network to be 
used. Other aspects involve the life cycle of the pathogen, seasonality, etc. 
Additionally, social and practical aspects involving public health policy and 
strategic planning play important roles in the problem. 

Regarding the issue of network structure, a few models have been proposed as 
useful substrates for disease propagation. Among these, truncated scale-free 
network structures [2] have received considerable interest [8|fTTj . In these net- 
works, each node has a probability P(k) to have k links (degree k) connecting 
to it, with P(k) being characterized by the form 



with k > k m i n , where k min is the lower degree that a node can have and k is an 
arbitrary degree cutoff reflecting the properties of the substrate network for 
the disease |14j . The reason for including the exponential cutoff is two- fold: 
first many real-world graphs appear to show this cutoff; second it makes the 
distribution normalizable for all A, and not just A > 2 [15]. 

Another important issue of propagation relates to the type of disease being 
considered and its dynamics. In this sense, a general model for a number of 
diseases (including the ones mentioned at the beginning) is the SIR model, 
which separates the population into three groups: susceptible, infected and 
recovered (or removed), approximating well the characteristics of many mi- 
croparasitic diseases |3j. The solution to the SIR model corresponds to the 
determination of the number of susceptible, infected, and recovered individ- 
uals at a given time. Public health officials are particularly interested in the 
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final outcome of the disease propagation, measured through the number of 
individuals Ssm., out of a population of N, that became infected at any time. 
Another useful way to express the solution of the model is through the average 
fraction of infected individuals (M SIR ) = (Ssm/N), where () denotes averages 
over realizations. 

A number of details related to SIR determine the methods that correctly yield 
Ssm [8]nT] - One common formulation of SIR assumes that on each time step, 
an infected node has a probability (3 to infect any of its susceptible neighbors, 
and once infected the node recovers in exactly time steps. This yields an 
overall probability T, called the transmissibility, to use any given network 
link of a node that becomes infected. For this case, when the networks have 
very simple structure [16] . (M SIR ) can be determined using a mapping to the 
link percolation model [312] of statistical physics [T7] (see below). If the SIR 
propagation details change, modified forms of percolation may be used [SfTT] . 

From the standpoint of public health policy and strategic planning, an impor- 
tant technical point is how to "define" what is considered to be an epidemic, 
because such definition determines the level of reaction that health organiza- 
tions (e.g., World Health Organization) will apply in dealing with a particular 
infectious disease event. In real- world disease spread situations, as pointed out 
in several references [2f8][TT] . epidemiologist are obliged to define a minimum 
number of people infected, or threshold s c to distinguish between a so called 
outbreak (a small number of individuals where no large intervention is called 
for), and an epidemic (a significant number of individuals in the population 
requiring large scale intervention). In Refs. [2ll8lfTT] . for instance, s c has been 
used, but its impact on average predictions of SIR has not been systematically 
addressed, even though it is representative of the sensitivity, or urgency, that 
epidemiologist assign to the disease in question. 

In this paper we address the importance of s c for SIR in complex networks. 
Using link percolation, we first concentrate on calculating the average frac- 
tion size (Msm(T, s c )) over SIR model realizations for which Ssm > s c- This 
quantity is important in the public health community to determine the av- 
erage expectation value for the epidemic size that can arise given the par- 
ticular pathogen and society affected, and the epidemic threshold s c chosen. 
To calculate SIR through link percolation, we find that a reweighting pro- 
cedure is necessary, that has been previously ignored. Once this reweight- 
ing is done, (M SIR (T, s c )) for large s c approaches P^T), corresponding to 
the average fractional size of the largest percolation cluster at T, but for s c 
smaller than a value that depends on the topology of the network, we find 
that (Msm(T, s c )) < Poo(T), for T < 1, indicating that the percolation result 
for Pqo is an upper bound. Since the choice of s c determines what is defined 
to be an epidemic, we also determine Q = Q(T,s c ), the probability that an 
SIR realization reaches Ssm > s c- Extending our results to situations such as 
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computer networks, where one should be able to declare an epidemic even if 
few computers are infected due to the "similarity" of the world population 
of computers (i.e. sharing the same operating system), and thus have large 
susceptibility, we find that similar results apply. 

The rest of the article is structured as follows. Section CD introduces details 
of the network model and where it applies, the link percolation method used 
to solve the SIR model, and the details of the reweighting procedure neces- 
sary to obtain correct averages. Sections [2] and [3] introduce and explain the 
results of the application of the model to disease propagation events in simu- 
lated networks and real- world examples (computer networks). Finally, Sec. H] 
summaries the results of the paper and presents our conclusions. 



1 Models and algorithm 

To construct networks of size N we use the Molloy-Reed algorithm [18J, and 
apply it to the degree distribution given by Eq. (JTJ). Simulations for this type 
of network have been performed before in Refs. [2] and [5] for N = 10 4 and 
10 5 , A = 2, k min = 1, k = 5, 10, 20 and s c = 100 and 200 [2D]. We perform our 
simulations for many values of k but we present our results only for k = 10. 
Our main results also hold for other degree distributions. Due to the fact that 
the lower degree is k m i n = 1 [21] and k is small, the network is very fragmented 
and the size of the initial biggest connected cluster (GC), labeled here as N GC , 
is typically 60% of the network (for k = 10). For all our simulations we work 
only on the GC of the original network because we are only concerned with the 
disease spread on connected communities. Isolated clusters cannot propagate 
a disease. 

To simulate SIR, we chose one node at random on the GC of the substrate 
network, and infect it. Per time step, this infected node has a probability (3 
to infect its first neighbors. Once a neighbor has been infected, it can infect 
one of its own susceptible neighbors, but it cannot be infected again nor infect 
another already infected or recovered node. All infected nodes recover after 
tR time steps of becoming infected [22] . The transmissibility T is the overall 
probability that a node infects one of its susceptible neighbors within the 
time frame t — 1 to t R , given by Ylt=i P0- ~ PY" 1 = 1 — (1 — PY R - For every 
realization of SIR, the total number of nodes that become infected after the 
infectious transmission has ended is given by 5sir. The values of Ssm satisfy 
a distribution $(Ssir). 

As mentioned in the introduction, another way to calculate Ssm is through 
the use of link percolation. This is a process in which an initial network is 
modified by removing a fraction 1 — T of its links (we use T as the probability 
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for a link to be present because of the mapping between link percolation 
and our SIR model). The effect of the removal is to generate a multitude of 
clusters, each being a group of nodes that can be reached from each other by 
following a sequence of edges connected to those nodes. Link percolation has a 
threshold value T = T c (the percolation threshold), characterized by the fact 
that, for T < T c , the size of the largest cluster typically scales as log N, and 
for T > T c , a large cluster emerges with a size that scales linearly with N, 
alongside a number of small clusters. Thus, a so-called percolation transition 
occurs at T = T c that takes the network from disconnected to connected. In 
general terms, a similar situation occurs in SIR, where a high likelihood of 
transmission of the disease (large T) between neighbors typically leads to a 
large epidemic, but if this likelihood is low (small T), only small localized 
outbreaks appear (a detailed description of the relation is developed below). 

To perform link percolation, we begin in the GC of the substrate network, 
and randomly eliminate links with probability 1 — T. Each realization of this 
process yields multiple connected clusters of various sizes. Realizations are 
then repeated multiple times, and a distribution of cluster sizes 4>(S P ) emerges. 
For the quantity P^T), we average over the largest cluster size produced in 
each realization. 

The relation between SIR and link percolation can be concretely explained in 
the following way: each SIR realization begins with a randomly chosen node 
of the GC, and the infection propagates to a set of nodes SgrR that can all be 
traced back to the original infection. The links used in this SIR realization, on 
average, where used with probability T and not used with probability 1 — T. 
To draw the correct connection to link percolation, we first must realize that 
in a given realization of percolation, only one of the many connected clusters 
can be chosen to represent the infection of SIR. By analogy with the classic 
Leath algorithm [23] of cluster creation in percolation, we can conclude that 
the clusters are randomly picked, with probability proportional to their size 
S p . Thus, one expects that the average size of SIR realizations is equivalent 
to a weighted average of percolation realizations, where the weight is given by 
Sp. 

With the previous arguments in mind, and given the dependence of the prob- 
lem on both T and s c , we compute (Msir(T, s c )) through 
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(M S ir(T, Sc ))= £ (2) 

In order to compare this to link percolation, we perform a weighted average 
to obtain (M p (T, s c )), given by 

(M P {T, s e )) = — 2= . (3) 

2^s p >s c z>p (Pydp) 
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We expect that both averages converge to the same value when enough realiza- 
tions are performed. Additionally, as s c is increased, we expect (M p (T, s c ^> 1)) 
P (X3 (T) for T > T c , because a progressively smaller number of small clusters 
enters into the averaging, and only the largest clusters are used. This creates 
an interesting scenario, in which P^T) is a good approximation of the epi- 
demic size only in the limit of a large threshold s c > S p (a function of T 
only, defined below), but for smaller s c , which is important in more aggressive 
diseases, only (M p (T,s c )) is the correct average. 



2 Results on the relative average size of the disease 

2. 1 Mapping between the average fraction size using SIR simulations and the 
average fraction size of all percolation cluster 

As a first step, we test that indeed (Msm(T, s c )) and (M p (T,s c )) are equal. 
In Fig. [T]we plot (Msm(T, s c )) and {M p (T,s c )) to check their agreement. The 
two curves overlap indicating that the mapping between the two quantities is 
correct. In the reminder (unless explicitly stated), we perform our simulations 
using link percolation as opposed to SIR. 

The mapping between the steady state of SIR and link percolation is compu- 
tationally very convenient for several reasons. First, performing simulations of 
SIR models is computationally more costly than link percolation. This is due 
to the fact that for SIR, only a single propagation occurs per realization, as op- 
posed to multiple clusters that appear for link percolation. Additionally, SIR 
propagation has to be performed in a dynamic fashion, which makes it neces- 
sary to test over time a given propagation condition, something that does not 
occur for link percolation, accelerating further the simulations. Finally, this 
mapping is convenient because it gives another conceptual framework in which 
to understand the relation between these two problems of disease propagation 
and percolation models. 

A final feature of Fig. [T] is the plot of P QO (T). This curve displays good agree- 
ment with (Msir(T, s c )) for the larger s c . We discuss this issue further in the 
next subsection. 



2.2 Effects of s c on the average size of epidemics 

In Fig. [2] a), we plot (M p (T,s c )) to explore the effect of s c on this aver- 
age. We can see from the plot that only for larger s c (for our simulation 
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parameters ~ 200) the curves of Poo(T) and (M p (T,s c )) coincide for T > T c 
(T c pa 0.34foriV = 10 5 ), while for smaller s c values they do not. The need to 
use large s c to approach Poo(T) had been realized previously [2"fg] . but not 
been commented on in any detail. We can see this behavior more clearly in 
Figf5]b), where we plot Poo CO — (M P (T, s c )) for different values of s c and find 
that Poc(T) is an upper bound of (M P (T, s c )), except for very large s c (See 
Ref. [25J ) . From the inset of Fig. [2]b), we can see that the difference reaches 
approximately 3% for large values of s c . 

The choice of s c has an extra consequence, which is to change the likelihood 
that a given pathogen propagation be declared as an epidemic. This proba- 
bility is relevant from the standpoint of readiness, because lower s c implies 
that it is more likely to consider almost any disease propagation as reaching 
the epidemic state. Thus, we define Q which represents the probability that 
an SIR with transmissibility T has size Ssm — s c . This quantity can be com- 
puted directly as the number of times Ssm > s c divided by the total number 
of realizations (See Fig. [3]). Analytically, Q can be related to $(S'sir) through 



Q = Es slR >_ s Ms sm ) = (4) 

^ssm>i n^smj S sm >s c 



where the last equality is a consequence of normalization. In order to calculate 
Q from the percolation results, we keep in mind the reweighting applied to 
Eq. Then, Q is given by 



q _ Es„> Se S p 4>(S P ) 
Es P >i S P (j)(S p ) 



where Es p >i S p <p(Sp) = (N GC ). In Fig. El we plot Q for SIR for T = 0.4, 
(T > T c ), using direct computation and compare it with the results ob- 
tained using Eq. ([5]). We can see that the agreement is excellent. In order 
to understand the scaling behavior of Q, we first consider the details of 
4>{S p ). From percolation theory it is known that, for T close and above T c , 
4>{S P ) ~ AS~ T exp(— S p / ) + F(S P — S^°), where r has the mean field value 
5/2. In the last expression, S p is a characteristic maximum finite cluster size 
which scales as |T — T c | _cr (a = 2), A is a measure of the relative statistical 
weight between the two terms (estimated below), F is a narrow function of 
its argument, and S™ = S?(T) = (N GC )Poo(T). 

To calculate Q, we use (p(S p ) and Eq. (5), and assume the continuum limit 
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over S p , giving 



Q 1 "(aQ-^ 



exp(-y S p *) + g p F(g p - gg°)] jo 

„-T+2 /nXi-r+2 coo 

a s c ~(i> p ) I J p r_ ox 

(W G C>(r-2) "T <iV GC > L*c ^ Op 

Coo 



(Ncc) [S;«s c <S-] (6) 

< s c ], 

where we approximated the first term of the integral by truncating the integra- 
tion at Sp, and simplifying F to a delta function (of integral 1, which relates 
to the value of A). Several Q regimes can be identified: (i) for s c <C S*, 
the contribution of (S p )~ T+2 is negligible and therefore Q ~ s~ T+2 \ (ii) for 
s c ~ S* , Q becomes dominated by a competition between the two terms of 
the integral and no clear scaling rules apply; (iii) for 5 p x < s c < S£°, Q ~ <S£°, 
and; (iv) for s c > S™, Q — > 0. From Fig. [3] we can identify those four regimes. 
In the figure the arrow represents approximately S p X) /{Ngc) ~ 0.12 from the 
simulation. The agreement between the theoretical scaling (see Eq. (jHJ)) and 
the simulation is excellent. 

Moreover, the value of A can be estimated from the fact that, for a system size 
{Nqc), the first term of 4>(S P ) accounts for the finite clusters present, and the 
integral of S P 4>(S P ) must be equal to the mass of the finite clusters. Therefore 



[(N GC ) - S™(T)] ~ A jf S; T+1 eM-S P /S*)dSp 



(r-2)((N GC )-S-(T)) 

l-(S£)- T+a • 1 ; 

Since the rest of the mass of the network is contained in a single spanning 
cluster, then the relative weight of the first to second term of <p(S p ) is A : 1, 
justifying the choice of the integral of F to be 1. The effects shown here hold 
also for other networks including real networks as shown below. 

One final result that can be derived from <fi(S p ) is the value of s c for which 
P QO (T) is a good approximation for (M p (T, s c )). From the previous results, we 
note that there is a "gap" in the distribution of sizes between S* and S™, 
which means that percolation generates very few clusters between these sizes. 
Thus, when determining (M p (T, s c )) , the significant statistical contributions 
are concentrated in clusters smaller than S£ and then in S™. For s c > S*, 
only the latter term contributes, driving (M p (T, s c )) — > P 00 (T). It is important 
to recognize that this result is independent of the system size Nqc, but not of 
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T, as Sp is a function of T. 



3 Application to Traceroute and DIMES networks 

The results we have presented for our model of human infectious disease prop- 
agation is applicable to other problems in the real world. This can be well 
illustrated for computer networks in which information is being broadcasted. 

One of the networks that describes the functional connectivity of the Inter- 
net is the Traceroute network, where the nodes are the routers and the links 
are the connection between them that transport IP packets. The network, as 
measured in Ref. [25], has N = 222934 nodes and L = 279510 links. This 
network can be represented by a Scale- Free network with A = 2.1 [22] • in 
order to obtain information of the Internet connectivity, a software probe is 
used called a Tracerouter tool, that sends IP packets on the Internet elicit- 
ing a reply from the targeted host. By citing the information of the packets' 
path to the various destinations, a network of router adjacencies is build [2Tj . 
Here, the SIR process can be understood as a router that has a random fail- 
ure (Infected), that can produce failures on neighbor nodes that are functional 
(Susceptible), and these new nodes become infected. Thus, after some time 
the router is practically disconnected from the communication network (Re- 
moved). The DIMES network |28j uses the same algorithm of searching than 
the Tracerouter network, the nodes are Autonomous Systems (AS) and the 
links are the connections between AS. The network has N = 20556 nodes and 
L = 62920 links. The description of the SIR process over DIMES is the same 
as the one explained before for the Tracerouter network. 

In Figs. H] and [5] we plot Poo(T) and (M p (T,s c )) for different values of s c 
as a function of T. For s c = 500, for Tracerouter and s c = 100 for DIMES 
network we can map this problem to P 00 (T) of link percolation. We can see 
that the problem maps into (M p (T,s c )) for any size of s c . We compute Q 
for both networks, those result are plotted in Figj6] a) and b) for Tracerouter 
and DIMES networks, respectively. For DIMES, T c — > 0, and thus first region 
cannot be seen [T7] . On the other hand, if T c is finite as in Tracerouter, Q has 
the four regions described for model networks (see Eq. (jB])). 



4 Summary 

We have shown that the choice of s c , the minimum SIR propagation size nec- 
essary to declare an epidemic, has important consequences on epidemiological 
predictions. Using percolation theory to calculate the average fractional size 



9 



(Msm(T, s c )) = {M p (T, s c )) of an epidemic, we find that the strength of the 
spanning link percolation cluster P^T) is an upper bound to (Msir(T, s c )), 
provided s c does not exceed 5^°(T), the typical size of finite clusters of link 
percolation, where pathological results can appear. When s c is between S* (T) 
and S™(T), Poo(P) is a good approximation to (M S m(T, s c )). For small val- 
ues of s c , Pqo is no longer a good approximation, and the average fractional 
size has to be computed directly. We also study Q, the probability that an 
SIR propagation reaches the epidemic mass s c , which has several interesting 
regimes including one that scales as s~ T+2 . We apply our results to real net- 
works (DIMES and Tracerouter) to measure the consequences of the choice s c 
on predictions of average outcome sizes of computer failure epidemics. 
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Fig. 1. Comparation between (M S tr(T, s c )} (□), (M p (T,s c )} (O), andPoo(T) of link 
percolation (full line). Empty symbols correspond to s c = 100, and dotted symbols 
to s c = 1. For the transmissibility in the SIR problem, we used (5 = 0.05 and a set 
of values of the recovery £r to cover a wide range of T. All the simulations were 
performed on the GC of networks with A = 2 , k = 10, k m i n = 1, and averaged over 
10 4 realizations. 
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Fig. 2. a) Plot of (M p (T,s c )) as a function of T, for s c = 200 (O), s c = 10(D) 
and s c = 1 (+). The full line represents Poo(T). The inset shows the details of 
the main plot close to T c ~ 0.32, i.e, for T near the percolation threshold. We 
can observe that the departure between Poo(T) and (Msir(T, s c )) is not negligible. 
b)Poo - (M p (T,s c )) as a function of T, for s c = 1 (□), s c = 10(*), s c = 50 (+) 
and s c = 200(O)- I n the inset we plot the details of the main plot around T c for 
s c = 10 (dot dashed line), s c = 50 (dashed line) and s c = 200 (full line). We observe 
that Poo(r) is an upper bound for (M p (T, s c )) [25]. In all the simulations we used 
N = 10 5 , A = 2, k = 10, k min = 1 and the averages where done over 10 3 realizations 
on the GC of networks of size ~ 0.6./V. 
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Fig. 3. Plot of Q for: SIR as a measure of the number of times an Ssm > $c divided 
by the number of realizations (full line). Link percolation over all clusters as in 
Eq .([5]) (O)- We observe that both curve are in good agreement. For small s c , Q 
has a power-law decaying behavior with exponent r — 2 = 1/2. The arrow represents 
approximately S2?/{Ngc) ~ 0-12 as predicted by theoretical scaling. 
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Fig. 4. Plot of (M p (T, s c )) as a function of T, for the Tracerouter network that has 
N = 222934, Links = 279510, P(k) ~ k x with A = 2.1, with s c = (o), s c = 2 (+) 
and s c = 100 (□). The full line represents P^iT). 
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Fig. 5. Plot of (Mp(T, s c )) as a function of T, for the DIMES network that has Scale 
Free distribution with A s=s 2.15, N = 20556, Links = 62920, for s c = (o), s c = 10 
(+) and s c = 500 (□). The full line represents P^T). 
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Fig. 6. Q as a function of s c for: a) Tracerouter network, with T = 0.25 (O)- b) 
DIMES network, with T = 0.02 (O), the exponent of the decreasing power-law is 
around 0.62, indicating that for this network r ~ 2.62. 
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